How do the Labs work? What about lecture?

  • Lecture will be mix of applied labs in R and theoretical lectures.
  • Sometimes lab and lecture will be intermixed. This is evolving based on the digital format.
  • Lecture is meant to provide theoretical grounds and basic understanding of the methods discussed.
  • Lab is meant to reinforce your understanding provide a basic framework applying the methods discussed.

How and what to turn in for the labs

  • Step 1 go to the Lab Website
    • Download the (all) or just this weeks lab

Screen Capture of Github:

Screen Capture of File Structure location:

  • Step 2 open up the [lab name].Rproj (for the first lab this Lab_1.Rproj)
    • This will bring up an RStudio Window

Screen Capture of RStudio:

  • Step 3 Knit the document

Screen Capture of Kniting:

Screen Capture of resulting html file:

What about R?

  • R is a dynamic programing language and basic computing skills is a necessary component to modern social science.
  • One of the key things that any social scientists need in the current erra is a basic understanding of how to program in a high level language.
  • All labs, HWs and Lecture is going to have this kind of ‘flavor’ – you will have to master some basic programing skills to be profecient in modern data analysis.

Goals of this Class

  • Maximally
    • Experience with computational methods in R
    • Knowledge of GLM for binary, Catagorical data
    • Knowledge of advanced methods of Propensity Score and Mixed Effects
    • Knowledge of bootstrap
  • Minimally
    • Be able to read and peer review scholars using these techniques
    • Know how to write up a technical report

How should I approach the readings?

The readings are based on this philosophy.

  • I have provided and will continue to provide a very comperhensive reading set.
    • Note these are meant to represent a resource for current and future research.
    • This means that you are not expected to read everything I have listed.
    • You are expected to learn how to skim what you need as you need it.
    • Some things will not make sense or will seem like extra at this time. That is ok! You may find it useful later (this will be true often in graduate school!)
  • Take away:
    • Skim - see what you know and what you don’t know.
    • Dig into what you need. This may be reviews on how R works, why it works and what it can do. This may be a new statistical technique.
    • Everything we learn in this class can be implementable in R.
    • Learning how to use complex software means lots of little failures for long term success.
    • Make friends. Work in groups.

What to hand in? for the lab

Knit this Rmarkdown document into an html file (lab_week_1_part2.html). This is the file to turn on canvas.

What to get from Lab 1

  • Lab 1 - Part 1
    • This is a big document with lots of resources.
    • Charlie’s notes will be very useful for the HWs (though a bit idiosyncratic)
    • Feel free to ignore things that don’t seem useful, but pay close attention to the details on how R work. And the pieces on MLE (we will start MLE next week).

Homeworks

  • Homeworks are hard.
  • Homeworks take time. Start early.
  • Homeworks cover multiple weeks of material. Do not worry if you are not sure how to start every question, we probably haven’t started the material yet if that is the case.
  • Homeworks are not meant to trick you. They are meant to teach how things work and why things work and test application. Keep this in mind.

R Review Activities

Below are four vectors of data:

n<-15
a<-LETTERS[sample(n,replace=TRUE)]
b<-rpois(n,1)
c<-rep(TRUE,n)
d<-rbinom(n,1,.25)

In the R window below create a list, matrix and data.frame object out of this data.

## Your R code here

Tidyverse

Use tidyvers filter and `select to choose only the rows with 1 in d and columns a and d.

## Your R code here

Review lm objects

The classic linear model is of the form,

\[y_i = \beta_0+\sum_i^k \beta_i x_i\] When using R to fit lm objects we use the formula enviroment.

Let’s grab an example data set from the excellent UCLA Statistics Consulting website.

For our data analysis below, we will use the crime dataset that appears in Statistical Methods for Social Sciences, Third Edition by Alan Agresti and Barbara Finlay (Prentice Hall, 1997). The variables are state id (sid), state name (state), violent crimes per 100,000 people (crime), murders per 1,000,000 (murder), the percent of the population living in metropolitan areas (pctmetro), the percent of the population that is white (pctwhite), percent of population with a high school education or above (pcths), percent of population living under poverty line (poverty), and percent of population that are single parents (single). It has 51 observations. We are going to use poverty and single to predict crime. - UCLA

## why do do I have foreign::read.dta? what is this?
cdata <- foreign::read.dta("https://stats.idre.ucla.edu/stat/data/crime.dta")

Data Summary

Data

head(cdata)
  sid state crime murder pctmetro pctwhite pcths poverty single
1   1    ak   761    9.0     41.8     75.2  86.6     9.1   14.3
2   2    al   780   11.6     67.4     73.5  66.9    17.4   11.5
3   3    ar   593   10.2     44.7     82.9  66.3    20.0   10.7
4   4    az   715    8.6     84.7     88.6  78.7    15.4   12.1
5   5    ca  1078   13.1     96.7     79.3  76.2    18.2   12.5
6   6    co   567    5.8     81.8     92.5  84.4     9.9   12.1

Plot 1

ggplot(data=cdata, aes(x=poverty, y=crime, group=1)) +
  geom_line()+
  geom_point()

Plot 2

ggplot(data=cdata, aes(x=single, y=crime, group=1)) +
  geom_line()+
  geom_point()

Regression Model

OLS

summary(ols <- lm(crime ~ poverty + single, data = cdata))

Call:
lm(formula = crime ~ poverty + single, data = cdata)

Residuals:
    Min      1Q  Median      3Q     Max 
-811.14 -114.27  -22.44  121.86  689.82 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1368.189    187.205  -7.308 2.48e-09 ***
poverty         6.787      8.989   0.755    0.454    
single        166.373     19.423   8.566 3.12e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 243.6 on 48 degrees of freedom
Multiple R-squared:  0.7072,    Adjusted R-squared:  0.695 
F-statistic: 57.96 on 2 and 48 DF,  p-value: 1.578e-13

Residual Diagnostics

opar <- par(mfrow = c(2,2), oma = c(0, 0, 1.1, 0))
plot(ols, las = 1)

From these plots, we can identify observations 9, 25, and 51 as possibly problematic to our model. We can look at these observations to see which states they represent.

cdata[c(9, 25, 51), 1:2]
   sid state
9    9    fl
25  25    ms
51  51    dc

QQ Plots

qqPlot(ols)

[1]  9 25

Outliers

influencePlot(ols, id.n=3)

      StudRes        Hat      CookD
9   3.1632792 0.04831771 0.14258909
25 -4.1109749 0.12668898 0.61387212
49 -0.6227287 0.17838219 0.02842712
51  2.7961066 0.53602150 2.63625193
d1 <- cooks.distance(ols)
r <- MASS::stdres(ols) ## here is package::function again!
a <- cbind(cdata, d1, r)
## Pretty Print
## How do we make this a function?
data.frame(lapply(a[d1 > 4/51, ],function(x){
  if(any(is.numeric(x))){
    return(round(x,3))
  }
  x}))
  sid state crime murder pctmetro pctwhite pcths poverty single    d1      r
1   1    ak   761    9.0     41.8     75.2  86.6     9.1   14.3 0.125 -1.397
2   9    fl  1206    8.9     93.0     83.5  74.4    17.8   10.6 0.143  2.903
3  25    ms   434   13.5     30.7     63.3  64.3    24.7   14.7 0.614 -3.563
4  51    dc  2922   78.5    100.0     31.8  73.1    26.4   22.1 2.636  2.616

Robust Regression

Robust regression is done by iterated re-weighted least squares (IRLS). The command for running robust regression is rlm in the MASS package. There are several weighting functions that can be used for IRLS. We are going to first use the Huber weights in this example. We will then look at the final weights created by the IRLS process. This can be very useful.

summary(rr.huber <- MASS::rlm(crime ~ poverty + single, data = cdata))

Call: rlm(formula = crime ~ poverty + single, data = cdata)
Residuals:
    Min      1Q  Median      3Q     Max 
-846.09 -125.80  -16.49  119.15  679.94 

Coefficients:
            Value      Std. Error t value   
(Intercept) -1423.0373   167.5899    -8.4912
poverty         8.8677     8.0467     1.1020
single        168.9858    17.3878     9.7186

Residual standard error: 181.8 on 48 degrees of freedom
hweights <- data.frame(state = cdata$state, resid = rr.huber$resid, weight = rr.huber$w)
hweights2 <- hweights[order(rr.huber$w), ]
hweights2[1:15, ]
   state      resid    weight
25    ms -846.08536 0.2889618
9     fl  679.94327 0.3595480
46    vt -410.48310 0.5955740
51    dc  376.34468 0.6494131
26    mt -356.13760 0.6864625
21    me -337.09622 0.7252263
31    nj  331.11603 0.7383578
14    il  319.10036 0.7661169
1     ak -313.15532 0.7807432
20    md  307.19142 0.7958154
19    ma  291.20817 0.8395172
18    la -266.95752 0.9159411
2     al  105.40319 1.0000000
3     ar   30.53589 1.0000000
4     az  -43.25299 1.0000000

We can see that roughly, as the absolute residual goes down, the weight goes up. In other words, cases with a large residuals tend to be down-weighted. This output shows us that the observation for Mississippi will be down-weighted the most. Florida will also be substantially down-weighted. All observations not shown above have a weight of 1. In OLS regression, all cases have a weight of 1. Hence, the more cases in the robust regression that have a weight close to one, the closer the results of the OLS and robust regressions.

Cluster SE

clse<-cluster.vcov(ols,cdata)
cat('Without Clustered SE\n')
Without Clustered SE
summary(ols)

Call:
lm(formula = crime ~ poverty + single, data = cdata)

Residuals:
    Min      1Q  Median      3Q     Max 
-811.14 -114.27  -22.44  121.86  689.82 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1368.189    187.205  -7.308 2.48e-09 ***
poverty         6.787      8.989   0.755    0.454    
single        166.373     19.423   8.566 3.12e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 243.6 on 48 degrees of freedom
Multiple R-squared:  0.7072,    Adjusted R-squared:  0.695 
F-statistic: 57.96 on 2 and 48 DF,  p-value: 1.578e-13
cat('\nWith Clustered SE\n')

With Clustered SE
coeftest(ols,clse) 

t test of coefficients:

              Estimate Std. Error t value  Pr(>|t|)    
(Intercept) -1368.1887   215.3707 -6.3527 7.235e-08 ***
poverty         6.7874    10.0913  0.6726    0.5044    
single        166.3727    22.9567  7.2472 3.075e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Nonlinear Regression

The chapter on Nonlinear Regresion in R by Fox & Weisberg is available online, here.

NLR Details

We can generalize the basic regression formula so that \(y\) is function of \(m(x,\theta)\) and some error term \(\epsilon\). \(m\) can now be some arbitrary function, where our model is, \[y = m(x,\theta)+\epsilon\] Fox & Weisberg begin by considering a logistic growth model for \(m\).

\[ y = \frac{\theta_1}{1+\exp[-(\theta_2+\theta_3 x)]}+\epsilon\]

Good first step towards understanding the proopsed linear growth model is to use the curve function in R.

Curve Fits

Theta1

theta<-c(1,1,1)
curve(theta[1]/(1+exp(-(-theta[2] + theta[3]*x))), from=-5, to=5, main="(b)")
abline(h=1/2, v=1, lty=2)

Theta2

theta<-c(1,1,2)
curve(theta[1]/(1+exp(-(-theta[2] + theta[3]*x))), from=-5, to=5, main="(b)")
abline(h=1/2, v=1, lty=2)

Theta10

theta<-c(1,1,10)
curve(theta[1]/(1+exp(-(-theta[2] + theta[3]*x))), from=-5, to=5, main="(b)")
abline(h=1/2, v=1, lty=2)

TheataX

Try various \(\theta\) values and save them as extra tabs.

theta<-c(1,1,1)
curve(theta[1]/(1+exp(-(-theta[2] + theta[3]*x))), from=-5, to=5, main="(b)")
abline(h=1/2, v=1, lty=2)

About the curve

Changing the values of the parameters \(\theta = (\theta_1,\theta_2,\theta_3)\) stretches or shrinks the axes, and changes the rate at which the curve varies from its lower value at 0 to its maximum value.

Stylized Facts from J&F

  • If \(\theta_3 > 0\), then as \(x\) gets larger the term \(\exp[-(\theta_2+\theta_3 x)]\) gets closer to 0, and so \(m(x, \theta)\) will approach the value \(\theta_1\) as an asymptote.

  • Assuming logistic population growth therefore imposes a limit to population size.

  • If \(\theta_3 > 0\), as \(x\rightarrow \infty\), the term \(\exp[-(\theta_2+\theta_3 x)]\) grows large without bound and so the mean will approach 0.

  • The logistic growth curve is symmetric about the value of \(x\) for which \(m(x, \theta)\) is midway between 0 and \(\theta_1\).

  • It is not hard to show that \(m(x = -\frac{\theta_2}{\theta_3}, \theta) = \frac{\theta_1}{2}\), and so the curve is symmetric about the midpoint \(x = -\frac{\theta_2}{\theta_3}\).

  • The parameter \(\theta_3\) controls how quickly the curve transitions from the lower asymptote of 0 to the upper asymptote at \(\theta_1\), and is therefore a growth-rate parameter.

Fiting Nonlinear Regression

R has function for fitting non-linear regression, nlr. J&F formalize the NLR model as follows,

\[ y = E[y|x]+\epsilon = m(x,\theta)+ \epsilon\]

This makes the assumption that \(E[y|x]\) (read as expection of y conditioned on x) depends on \(x\) through the *kernal$ function \(m(x,\theta)\).

Typically, there are also assumptions made on \(\epsilon\)

  • J&F assume the errors (\(\epsilon\)) are independent with variance \(\frac{\sigma^2}{w}\).
  • Where the \(w\) are known nonnegative weights.
  • \(\sigma^2\) is an unknown variance to be estimated from the data.

Under these assumptons the nls function in R can be used to estimate \(\theta\) as the values that minimize the residual sum of squares.

\[S(\theta) = \sum w[y-m(\theta,x)]^2\]

Let’s consider the case where \(w=1\) and \(m(x,\theta)= \beta_0+\beta_1x\), then \[S(\beta) = \sum [y-(\beta_0+\beta_1x)]^2\] * To find the \(\theta\) of interest we simply want the \(\arg \min_{\theta} \left\{ \sum [y-(\beta_0+\beta_1x)]^2 \right\}\)
+ This is an optimization problem! * Notice that there is no requirement that minimizing \(S(\theta)\) has a closed form solution. In fact this will rarely be the case. * This will generally only be solvable numerically by an iterative algorithm such as Gauss-Newton which uses derivative solve this optimization problem.

Fitting the general nls model requires a bit more hand-holding than the usual ‘lm’ case.

  • The user needs to supply a starting point for the optimizer. This starting point can have a big effect on the final values.

Let’s look at the nls function

args(nls)
function (formula, data = parent.frame(), start, control = nls.control(), 
    algorithm = c("default", "plinear", "port"), trace = FALSE, 
    subset, weights, na.action, model = FALSE, lower = -Inf, 
    upper = Inf, ...) 
NULL

How do we find initial conditions? Typically you need to work with the proposed function, in the case of growth model under consideration: \[ \begin{aligned} y \approx \frac{\theta_1}{1+\exp[-(\theta_2+\theta_3 x)]}\\ \frac{y}{\theta_1} \approx \frac{1}{1+\exp[-(\theta_2+\theta_3 x)]}\\ \frac{1}{\frac{y}{\theta_1}} \approx 1+\exp[-(\theta_2+\theta_3 x)]\\ \frac{1}{\frac{y}{\theta_1}}-1 \approx \exp[-(\theta_2+\theta_3 x)] \\ \frac{\frac{y}{\theta_1}}{1-\frac{y}{\theta_1}} \approx \exp[-(\theta_2+\theta_3 x)]\\ \frac{1-\frac{y}{\theta_1}}{\frac{y}{\theta_1}} \approx \exp[(\theta_2+\theta_3 x)]\\ \log\left( \frac{1-\frac{y}{\theta_1}}{\frac{y}{\theta_1}} \right) \approx \theta_2+\theta_3 x]\\ logit\left( \frac{y}{\theta_1} \right) \approx \theta_2+\theta_3 x \end{aligned} \]

This type of mathematical manipulation can be used to find good starting parameters. Basically the goal is to come up with something that can be done with OLS regression model to get initial values (there are other strategies, but this is a good one).

F&W Example

J&W have a data set of US Population by year and fit a NLS model it to it. Let’s try!

Data

data("USPop")
plot(population ~ year, data=USPop, main="")
abline(lm(population ~ year, data=USPop))

OLS model (Starting Value)

Based on our derivation, we can use \(logit\left( \frac{y}{\theta_1} \right) \approx \theta_2+\theta_3 x\) to get our starting conditions. J&F pull 400 as a starting paramter for \(\theta_1\) (this is based on the US population of 308 Million).

summary(ols<-lm(logit(population/400) ~ year, USPop))

Call:
lm(formula = logit(population/400) ~ year, data = USPop)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.23258 -0.11036 -0.02292  0.12850  0.19658 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -4.925e+01  8.757e-01  -56.24   <2e-16 ***
year         2.507e-02  4.618e-04   54.27   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1374 on 20 degrees of freedom
Multiple R-squared:  0.9933,    Adjusted R-squared:  0.9929 
F-statistic:  2946 on 1 and 20 DF,  p-value: < 2.2e-16
theta0<-c(400,coef(ols))

NLS Model

pop.mod <- nls(population ~ theta1/(1 + exp(-(theta2 + theta3*year))),
start=list(theta1 = theta0[1], theta2 = theta0[2], theta3 = theta0[3]),
data=USPop, trace=TRUE)
summary(pop.mod)

Formula: population ~ theta1/(1 + exp(-(theta2 + theta3 * year)))

Parameters:
         Estimate Std. Error t value Pr(>|t|)    
theta1 440.833286  35.000130   12.60 1.14e-10 ***
theta2 -42.706981   1.839138  -23.22 2.08e-15 ***
theta3   0.021606   0.001007   21.45 8.87e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.909 on 19 degrees of freedom

Number of iterations to convergence: 6 
Achieved convergence tolerance: 1.965e-06
  • This estimates that the upper bound for the US population is 441 Million.

We can compute the CI straightforwardly,

Confint(pop.mod)
1142.14 :  -42.70698109   0.02160591
466.1412 :  -43.66489321   0.02213789
466.1346 :  -43.65634716   0.02213341
466.1346 :  -43.65664041   0.02213356
497.7078 :  -44.54779253   0.02262871
493.3715 :  -44.68491507   0.02270226
493.3706 :  -44.67956486   0.02269951
493.3706 :  -44.67971345   0.02269959
542.7795 :  -45.57997651   0.02319767
539.2901 :  -45.71318732   0.02326901
539.2891 :  -45.70742409   0.02326605
539.2891 :  -45.70756914   0.02326612
605.9723 :  -46.616309   0.023767
603.3437 :  -46.74098776   0.02383369
603.3426 :  -46.73494751   0.02383058
603.3426 :  -46.73509042   0.02383065
687.136 :  -47.65135132   0.02433405
685.1334 :  -47.76819037   0.02439649
685.1322 :  -47.76181507   0.02439319
685.1322 :  -47.76196565   0.02439327
1142.14 :  -42.70698109   0.02160591
464.418 :  -41.82945829   0.02111857
464.2982 :  -41.87902845   0.02114371
464.298 :  -41.87699565   0.02114267
464.298 :  -41.87707843   0.02114271
486.9909 :  -40.99500282   0.02065039
482.2826 :  -41.10702918   0.02071069
482.282 :  -41.10340886   0.02070885
482.282 :  -41.10357589   0.02070893
517.3378 :  -40.23050536   0.02021932
511.6385 :  -40.34499469   0.02028106
511.6378 :  -40.34129405   0.02027918
511.6378 :  -40.34148392   0.02027928
559.3683 :  -39.47781491   0.01979236
552.0519 :  -39.59793152   0.01985728
552.0511 :  -39.59399087   0.01985528
552.0511 :  -39.5942142   0.0198554
612.7077 :  -38.74065858   0.01937123
603.2196 :  -38.86677499   0.01943958
603.2186 :  -38.86251751   0.01943742
603.2186 :  -38.86278011   0.01943755
677.2733 :  -38.02011576   0.01895617
664.7898 :  -38.15285715   0.01902833
664.7885 :  -38.14818934   0.01902597
664.7885 :  -38.14850058   0.01902613
82077.43 :  440.83328575   0.02160591
70303.29 :  211.39618466   0.02227898
3800.453 :  416.84714560   0.02208154
468.0991 :  421.86364236   0.02216842
465.4392 :  423.97062647   0.02216146
465.4368 :  424.10387250   0.02216117
465.4368 :  424.10831947   0.02216116
490.345 :  407.71185884   0.02270551
487.1578 :  409.70421552   0.02270351
487.1575 :  409.74290476   0.02270342
487.1575 :  409.74452280   0.02270341
525.4295 :  395.09449653   0.02325647
522.9987 :  396.7621240   0.0232547
522.9984 :  396.80016200   0.02325461
522.9984 :  396.8021399   0.0232546
574.9162 :  383.60770117   0.02381652
572.9518 :  385.04668783   0.02381492
572.9515 :  385.08428441   0.02381482
572.9515 :  385.08661866   0.02381482
638.6179 :  373.14266243   0.02438595
637.013 :  374.3933217   0.0243845
637.0127 :  374.43003229   0.02438439
637.0127 :  374.43268370   0.02438439
716.5009 :  363.5702661   0.0249651
715.1759 :  364.66472061   0.02496377
715.1754 :  364.70027153   0.02496366
715.1754 :  364.70320008   0.02496365
127918.4 :  440.83328575   0.02160591
2471.314 :  375.33500774   0.02129055
781.7662 :  434.25026378   0.02107477
466.5408 :  459.50499024   0.02104805
465.7549 :  460.13543612   0.02104816
465.7549 :  460.13458006   0.02104816
492.8714 :  478.67426923   0.02051242
488.9907 :  481.68757566   0.02050969
488.9906 :  481.70217018   0.02050967
532.4315 :  502.82744843   0.01998223
527.5238 :  506.46428102   0.01997914
527.5236 :  506.46483410   0.01997916
587.8096 :  530.71544188   0.01945962
581.3876 :  535.20818829   0.01945606
581.3872 :  535.18674409   0.01945612
659.1937 :  563.29834410   0.01894419
650.6288 :  568.94185005   0.01894005
650.6279 :  568.88522500   0.01894016
650.6279 :  568.88730080   0.01894016
92336.39 :  440.83329 -42.70698
56496.47 :  241.08495 -41.42246
4547.262 :  450.21372 -41.85451
471.8181 :  456.49645 -41.67643
465.6882 :  460.22626 -41.69691
465.6852 :  460.3593 -41.6973
465.6852 :  460.36059 -41.69731
492.7987 :  479.2001 -40.7232
488.6896 :  482.20918 -40.72837
488.6895 :  482.21866 -40.72837
532.0282 :  503.74239 -39.77426
526.7959 :  507.3806 -39.7801
526.7957 :  507.37662 -39.78005
586.8953 :  532.1639 -38.8457
579.9889 :  536.67055 -38.85243
579.9885 :  536.6464 -38.8523
657.5628 :  565.49644 -37.93787
648.2509 :  571.17875 -37.94572
648.25 :  571.1239 -37.9455
648.25 :  571.1259 -37.9455
143344.5 :  440.83329 -42.70698
2879.984 :  358.21890 -43.27218
660.2653 :  403.67316 -43.66921
465.9187 :  423.19218 -43.72047
465.44 :  423.86907 -43.72134
465.44 :  423.87535 -43.72137
490.6442 :  407.25400 -44.71562
487.2766 :  409.25699 -44.71946
487.2763 :  409.29141 -44.71961
487.2763 :  409.29271 -44.71962
525.8817 :  394.47734 -45.73379
523.321 :  396.15218 -45.73717
523.3207 :  396.18656 -45.73734
523.3207 :  396.18820 -45.73735
575.6098 :  382.87352 -46.77141
573.5536 :  384.31511 -46.77446
573.5532 :  384.34944 -46.77464
573.5532 :  384.35143 -46.77465
639.6297 :  372.32039 -47.82898
637.9587 :  373.57083 -47.83176
637.9583 :  373.60462 -47.83195
637.9583 :  373.60693 -47.83197
717.894 :  362.68207 -48.90703
716.5205 :  363.77459 -48.90956
716.5201 :  363.80753 -48.90976
716.5201 :  363.81012 -48.90978
           Estimate         2.5%        97.5%
theta1 440.83328575 381.49365542 536.93859220
theta2 -42.70698109 -46.58630995 -39.11083492
theta3   0.02160591   0.01961309   0.02371883

Where \(\hat{\sigma} = \sqrt{S(\hat{\theta})/(n-k)\). Let’s do this calculation in R!

S<-function(theta=c(1,1,1)){
  ## My code here!
  return(1)
}
n<-NROW(USPop)
k<-3
sigma_hat<-sqrt(S(c(1,1,1))/(n-k))
sigma_hat
[1] 0.2294157
  • The estimated year in which the population is half the asymptote is \(-\hat{\theta}_2/\hat{\theta}_3 = 1976.6\).

We can get the SE for this ratio by the delta method (calculus approximation):

deltaMethod(pop.mod, "-theta2/theta3")
                Estimate        SE     2.5 % 97.5 %
-theta2/theta3 1976.6341    7.5558 1961.8250 1991.4

Which gives a SE of about 7.5 years.

NLS Plots

par(mfrow=c(1,2))

plot(population ~ year, USPop, xlim=c(1790, 2100), ylim=c(0, 450))
with(USPop, lines(seq(1790, 2100, by=10),
predict(pop.mod, data.frame(year=seq(1790, 2100, by=10))), lwd=2))
points(2010, 308.745538, pch=16, cex=1.3)
abline(h=0, lty=2)
abline(h=coef(pop.mod)[1], lty=2)
abline(h=0.5*coef(pop.mod)[1], lty=2)
abline(v= -coef(pop.mod)[2]/coef(pop.mod)[3], lty=2)

with(USPop, plot(year, residuals(pop.mod), type='b'))
abline(h=0, lty=2)

NLS Prediction

Let’s check the quality of the predictions from J&F for 2000 t0 2019. We can grab the data from the US Census Burea’s website.

wiki<-"https://en.wikipedia.org/wiki/Demographics_of_the_United_States" %>% read_html() %>% html_table(fill=TRUE, header=TRUE)
us_pop<-wiki[[8]][,1:2]
names(us_pop)<-c("Year","Pop")
us_pop$Pop_num<-as.numeric(gsub(",","",us_pop$Pop))/1000000 ## Normalize to Millions

uspop01_19<-us_pop %>% filter(Year%in%2001:2019)%>%select(Pop_num)
p_uspop01_10<-predict(pop.mod, data.frame(year=seq(2001, 2019, by=10)))
data.frame(TruePop=uspop01_19,PredPop=p_uspop01_10,diff=uspop01_19-p_uspop01_10)
   Pop_num  PredPop    Pop_num.1
1  285.082 277.1317   7.95031366
2  287.804 298.6838 -10.87983715
3  290.326 277.1317  13.19431366
4  293.046 298.6838  -5.63783715
5  295.753 277.1317  18.62131366
6  298.593 298.6838  -0.09083715
7  301.580 277.1317  24.44831366
8  304.375 298.6838   5.69116285
9  307.007 277.1317  29.87531366
10 309.330 298.6838  10.64616285
11 311.583 277.1317  34.45131366
12 313.874 298.6838  15.19016285
13 316.129 277.1317  38.99731366
14 319.113 298.6838  20.42916285
15 321.442 277.1317  44.31031366
16 323.100 298.6838  24.41616285
17 325.719 277.1317  48.58731366
18 327.167 298.6838  28.48316285

NLS In Class Exercist

Compute the NLS and LM models for the population of Seattle and California. Fit the models from past to 2000 and predict 2001 to present. Which performs better, the NLS growth model or the linear model?

NLS to Seattle

Data

wiki<-"https://en.wikipedia.org/wiki/Seattle" %>% read_html() %>% html_table(fill=TRUE, header=TRUE)
seattle_pop<-wiki[[3]]
seattle_pop<-seattle_pop[2:18,1:2]
names(seattle_pop)<-c("year","pop")
seattle_pop$year<-as.integer(gsub("Est.","",seattle_pop$year))
seattle_pop$pop_numeric<-as.numeric(gsub(",","",seattle_pop$pop))
seattle_pop
   year     pop pop_numeric
2  1860     188         188
3  1870   1,107        1107
4  1880   3,533        3533
5  1890  42,837       42837
6  1900  80,671       80671
7  1910 237,194      237194
8  1920 315,312      315312
9  1930 365,583      365583
10 1940 368,302      368302
11 1950 467,591      467591
12 1960 557,087      557087
13 1970 530,831      530831
14 1980 493,846      493846
15 1990 516,259      516259
16 2000 563,374      563374
17 2010 608,660      608660
18 2018 744,955      744955

Plot

plot(pop_numeric~year,data=seattle_pop)
abline(lm(pop_numeric ~ year, data=seattle_pop))

NLS to California

Data

wiki<-"https://en.wikipedia.org/wiki/Demographics_of_California" %>% read_html() %>% html_table(fill=TRUE, header=TRUE)
california_pop<-wiki[[8]]
california_pop<-california_pop[,1:2]
names(california_pop)<-c("year","pop")
california_pop$year<-as.integer(california_pop$year)
california_pop$pop_numeric<-as.numeric(gsub(",","",california_pop$pop))/1000
head(california_pop)
  year       pop pop_numeric
1 1909 2,282,000        2282
2 1910 2,406,000        2406
3 1911 2,534,000        2534
4 1912 2,668,000        2668
5 1913 2,811,000        2811
6 1914 2,934,000        2934

Plot

plot(log(pop_numeric)~year,data=california_pop)
abline(lm(log(pop_numeric) ~ year, data=california_pop))